-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize BTree nodes directly in the heap #81494
Conversation
Benchmark results on Ryzen 7 3800X running Fedora 33name bench.master ns/iter bench.node-init ns/iter diff ns/iter diff % speedup btree::map::clone_fat_100 110,276 76,469 -33,807 -30.66% x 1.44 btree::map::clone_fat_100_and_clear 109,863 76,634 -33,229 -30.25% x 1.43 btree::map::clone_fat_100_and_drain_all 175,486 139,632 -35,854 -20.43% x 1.26 btree::map::clone_fat_100_and_drain_half 150,244 120,815 -29,429 -19.59% x 1.24 btree::map::clone_fat_100_and_into_iter 126,147 90,100 -36,047 -28.58% x 1.40 btree::map::clone_fat_100_and_pop_all 185,214 149,954 -35,260 -19.04% x 1.24 btree::map::clone_fat_100_and_remove_all 209,143 175,649 -33,494 -16.01% x 1.19 btree::map::clone_fat_100_and_remove_half 163,341 134,098 -29,243 -17.90% x 1.22 btree::map::clone_fat_val_100 51,248 34,809 -16,439 -32.08% x 1.47 btree::map::clone_fat_val_100_and_clear 51,594 34,365 -17,229 -33.39% x 1.50 btree::map::clone_fat_val_100_and_drain_all 76,033 59,356 -16,677 -21.93% x 1.28 btree::map::clone_fat_val_100_and_drain_half 66,150 52,769 -13,381 -20.23% x 1.25 btree::map::clone_fat_val_100_and_into_iter 57,973 40,969 -17,004 -29.33% x 1.42 btree::map::clone_fat_val_100_and_pop_all 80,080 63,314 -16,766 -20.94% x 1.26 btree::map::clone_fat_val_100_and_remove_all 85,727 69,253 -16,474 -19.22% x 1.24 btree::map::clone_fat_val_100_and_remove_half 68,964 55,972 -12,992 -18.84% x 1.23 btree::map::clone_slim_100 1,423 1,323 -100 -7.03% x 1.08 btree::map::clone_slim_100_and_clear 1,431 1,328 -103 -7.20% x 1.08 btree::map::clone_slim_100_and_drain_all 2,752 2,612 -140 -5.09% x 1.05 btree::map::clone_slim_100_and_drain_half 2,312 2,102 -210 -9.08% x 1.10 btree::map::clone_slim_100_and_into_iter 1,428 1,070 -358 -25.07% x 1.33 btree::map::clone_slim_100_and_pop_all 2,981 2,798 -183 -6.14% x 1.07 btree::map::clone_slim_100_and_remove_all 4,183 4,026 -157 -3.75% x 1.04 btree::map::clone_slim_100_and_remove_half 2,360 2,312 -48 -2.03% x 1.02 btree::map::clone_slim_10k 187,223 179,762 -7,461 -3.99% x 1.04 btree::map::clone_slim_10k_and_clear 187,103 178,952 -8,151 -4.36% x 1.05 btree::map::clone_slim_10k_and_drain_all 328,127 309,945 -18,182 -5.54% x 1.06 btree::map::clone_slim_10k_and_drain_half 281,123 272,728 -8,395 -2.99% x 1.03 btree::map::clone_slim_10k_and_into_iter 187,027 150,913 -36,114 -19.31% x 1.24 btree::map::clone_slim_10k_and_pop_all 336,710 324,753 -11,957 -3.55% x 1.04 btree::map::clone_slim_10k_and_remove_all 508,120 490,692 -17,428 -3.43% x 1.04 btree::map::clone_slim_10k_and_remove_half 455,783 430,063 -25,720 -5.64% x 1.06 btree::map::find_rand_100 9 10 1 11.11% x 0.90 btree::map::find_rand_10_000 45 45 0 0.00% x 1.00 btree::map::find_seq_100 10 10 0 0.00% x 1.00 btree::map::find_seq_10_000 30 30 0 0.00% x 1.00 btree::map::first_and_last_0 9 9 0 0.00% x 1.00 btree::map::first_and_last_100 35 33 -2 -5.71% x 1.06 btree::map::first_and_last_10k 54 54 0 0.00% x 1.00 btree::map::insert_rand_100 39 38 -1 -2.56% x 1.03 btree::map::insert_rand_10_000 39 38 -1 -2.56% x 1.03 btree::map::insert_seq_100 44 41 -3 -6.82% x 1.07 btree::map::insert_seq_10_000 91 87 -4 -4.40% x 1.05 btree::map::iter_0 685 688 3 0.44% x 1.00 btree::map::iter_1 12,947 12,923 -24 -0.19% x 1.00 btree::map::iter_100 13,089 13,382 293 2.24% x 0.98 btree::map::iter_10k 13,630 13,843 213 1.56% x 0.98 btree::map::iter_1m 14,120 14,453 333 2.36% x 0.98 btree::map::iteration_1000 3,919 4,008 89 2.27% x 0.98 btree::map::iteration_100000 473,040 477,477 4,437 0.94% x 0.99 btree::map::iteration_20 73 74 1 1.37% x 0.99 btree::map::iteration_mut_1000 4,065 4,004 -61 -1.50% x 1.02 btree::map::iteration_mut_100000 468,690 434,082 -34,608 -7.38% x 1.08 btree::map::iteration_mut_20 78 73 -5 -6.41% x 1.07 btree::map::range_included_excluded 395,607 384,372 -11,235 -2.84% x 1.03 btree::map::range_included_included 407,530 380,624 -26,906 -6.60% x 1.07 btree::map::range_included_unbounded 110,565 125,029 14,464 13.08% x 0.88 btree::map::range_unbounded_unbounded 74,247 75,505 1,258 1.69% x 0.98 btree::map::range_unbounded_vs_iter 136,143 138,270 2,127 1.56% x 0.98 btree::set::clone_100 1,234 1,195 -39 -3.16% x 1.03 btree::set::clone_100_and_clear 1,250 1,199 -51 -4.08% x 1.04 btree::set::clone_100_and_drain_all 2,361 2,321 -40 -1.69% x 1.02 btree::set::clone_100_and_drain_half 1,911 1,829 -82 -4.29% x 1.04 btree::set::clone_100_and_into_iter 1,253 1,208 -45 -3.59% x 1.04 btree::set::clone_100_and_pop_all 1,674 1,634 -40 -2.39% x 1.02 btree::set::clone_100_and_remove_all 2,764 2,258 -506 -18.31% x 1.22 btree::set::clone_100_and_remove_half 1,817 1,738 -79 -4.35% x 1.05 btree::set::clone_10k 147,634 141,749 -5,885 -3.99% x 1.04 btree::set::clone_10k_and_clear 139,936 132,077 -7,859 -5.62% x 1.06 btree::set::clone_10k_and_drain_all 250,283 243,180 -7,103 -2.84% x 1.03 btree::set::clone_10k_and_drain_half 218,285 211,095 -7,190 -3.29% x 1.03 btree::set::clone_10k_and_into_iter 143,524 135,397 -8,127 -5.66% x 1.06 btree::set::clone_10k_and_pop_all 191,210 179,708 -11,502 -6.02% x 1.06 btree::set::clone_10k_and_remove_all 325,636 283,829 -41,807 -12.84% x 1.15 btree::set::clone_10k_and_remove_half 343,480 338,654 -4,826 -1.41% x 1.01 btree::set::difference_random_100_vs_100 676 700 24 3.55% x 0.97 btree::set::difference_random_100_vs_10k 2,393 2,356 -37 -1.55% x 1.02 btree::set::difference_random_10k_vs_100 49,785 48,707 -1,078 -2.17% x 1.02 btree::set::difference_random_10k_vs_10k 161,528 161,777 249 0.15% x 1.00 btree::set::difference_staggered_100_vs_100 615 605 -10 -1.63% x 1.02 btree::set::difference_staggered_100_vs_10k 2,123 2,163 40 1.88% x 0.98 btree::set::difference_staggered_10k_vs_10k 59,483 60,204 721 1.21% x 0.99 btree::set::intersection_100_neg_vs_100_pos 14 13 -1 -7.14% x 1.08 btree::set::intersection_100_neg_vs_10k_pos 15 15 0 0.00% x 1.00 btree::set::intersection_100_pos_vs_100_neg 13 13 0 0.00% x 1.00 btree::set::intersection_100_pos_vs_10k_neg 15 15 0 0.00% x 1.00 btree::set::intersection_10k_neg_vs_100_pos 17 14 -3 -17.65% x 1.21 btree::set::intersection_10k_neg_vs_10k_pos 16 16 0 0.00% x 1.00 btree::set::intersection_10k_pos_vs_100_neg 14 14 0 0.00% x 1.00 btree::set::intersection_10k_pos_vs_10k_neg 16 17 1 6.25% x 0.94 btree::set::intersection_random_100_vs_100 655 690 35 5.34% x 0.95 btree::set::intersection_random_100_vs_10k 2,013 2,153 140 6.95% x 0.93 btree::set::intersection_random_10k_vs_100 2,067 2,243 176 8.51% x 0.92 btree::set::intersection_random_10k_vs_10k 149,295 146,140 -3,155 -2.11% x 1.02 btree::set::intersection_staggered_100_vs_100 545 530 -15 -2.75% x 1.03 btree::set::intersection_staggered_100_vs_10k 1,953 1,983 30 1.54% x 0.98 btree::set::intersection_staggered_10k_vs_10k 53,822 53,558 -264 -0.49% x 1.00 btree::set::is_subset_100_vs_100 503 510 7 1.39% x 0.99 btree::set::is_subset_100_vs_10k 1,278 1,607 329 25.74% x 0.80 btree::set::is_subset_10k_vs_100 2 2 0 0.00% x 1.00 btree::set::is_subset_10k_vs_10k 51,933 51,803 -130 -0.25% x 1.00 It's a consistent win on clones, especially |
I rebased and tackled #80886 in the ssomers:btree-node-init branch (which I don't know how to formally offer because it doesn't compile on last week's master). I got the same headroom as using a similar benchmark improvement on an Intel Core i5-9500F running Windows x64
|
This looks overall good to me. I'm guessing there's a few places we dereference the Box during destruction or movement of the key/value pairs that could be changed to avoid putting keys and/or values on the stack too, but that can be tackled separately. @bors r+ rollup=never (likely to at least be perf noisy) |
📌 Commit d828de6280c5cd6e4ce24961af111e19f28c7668 has been approved by |
⌛ Testing commit d828de6280c5cd6e4ce24961af111e19f28c7668 with merge ce13e5d02f6cffd21da078a31544ffbf4b8a070e... |
💔 Test failed - checks-actions |
This comment has been minimized.
This comment has been minimized.
This PR needs to tackle #80886 such as in the ssomers:btree-node-init branch. I don't know how to formally offer such a change in this PR without mixing in a whole lot of changes such as #80886 itself. |
d828de6
to
a5c6a95
Compare
Thanks @ssomers -- I rebased this PR and cherry-picked your fix. |
a5c6a95
to
5a58cf4
Compare
@bors r+ rollup=never |
📌 Commit 5a58cf4 has been approved by |
☀️ Test successful - checks-actions |
It looks like this ended up being a pretty uniform regression on a number of benchmarks, including real world ones. It doesn't seem like it should be, since it's presumably doing strictly less work. I think it would be good to examine a cachegrind diff or similar to try to narrow down exactly where the extra cost is sneaking in so we can perhaps mitigate it. |
Right, that's my expectation too. I'll try to look into this... |
I wrote a standalone microbenchmark confirming that "strictly less work" takes ~3% more time.
The difference disappears when taking out the |
Could we also do that? There's no reason we can't initialize more memory, though it seems odd that a 16-bit write would be slower than 32 (or 64). It also seems like the difference you're seeing is potentially just noise, though hard to tell. Certainly those ranges overlap :) |
I did try to let the I'm not at all surprised a 16-bit write would be slower. Depending on CPU architecture (which I only knew little about last century) and instructions generated (which I know nothing about), it may need to first read 64 bits, surgically replace 16 bits, and write back 64 bits. |
I finally managed to get Godbolt to produce some readable comparison. The stack implementation copies 128 zero bits at once with some fancy modern 128 bit instructions, while the heap implementation carefully copies 64 and 32 of those. One wrong conclusion is that the difference is due to the example setting So I think the difference is that somehow the stack implementation realizes it has 64 bits of space to dump bits in, while the directly-on-the-heap implementation writes only 32 bits. Note that it doesn't limit to 16 bits, the size of There's no difference between heap and stack initialization (on my machine) if the initialized field is |
We can avoid any stack-local nodes entirely by using
Box::new_uninit
, and since the nodes are mostlyMaybeUninit
fields, we only need a couple of actual writes beforeassume_init
. This should help with the stack overflows in #81444, and may also improve performance in general.r? @Mark-Simulacrum
cc @ssomers